<!DOCTYPE html>
<html>
<head>
    

    

    



    <meta charset="utf-8">
    
    
    
    <title>Scrapy学习（二） 入门 | 四畳半神话大系 | 圈地自萌</title>
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
    
    <meta name="theme-color" content="#3F51B5">
    
    
    <meta name="keywords" content="学习,Scrapy">
    <meta name="description" content="快速入门接上篇Scrapy学习（一） 安装，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。">
<meta property="og:type" content="article">
<meta property="og:title" content="Scrapy学习（二） 入门">
<meta property="og:url" content="http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/index.html">
<meta property="og:site_name" content="四畳半神话大系">
<meta property="og:description" content="快速入门接上篇Scrapy学习（一） 安装，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。">
<meta property="og:updated_time" content="2017-02-04T07:08:23.258Z">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="Scrapy学习（二） 入门">
<meta name="twitter:description" content="快速入门接上篇Scrapy学习（一） 安装，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。">
    
        <link rel="alternative" href="/atom.xml" title="四畳半神话大系" type="application/atom+xml">
    
    <link rel="shortcut icon" href="/favicon.ico">
    <link rel="stylesheet" href="/css/style.css?v=1.4.14">
    <script>window.lazyScripts=[]</script>
</head>

<body>
    <div id="loading" class="active"></div>

    <aside id="menu" class="hide" >
  <div class="inner flex-row-vertical">
    <a href="javascript:;" class="header-icon waves-effect waves-circle waves-light" id="menu-off">
        <i class="icon icon-lg icon-close"></i>
    </a>
    <div class="brand-wrap">
      <div class="brand">
        <a href="/" class="avatar waves-effect waves-circle waves-light">
          <img src="/images/avatar.jpg">
        </a>
        <hgroup class="introduce">
          <h5 class="nickname">amoyiki</h5>
          <a href="mailto:amoyiki#gmail.com" title="amoyiki#gmail.com" class="mail">amoyiki#gmail.com</a>
        </hgroup>
      </div>
    </div>
    <div class="scroll-wrap flex-col">
      <ul class="nav">
        
            <li class="waves-block waves-effect">
              <a href="/"  >
                <i class="icon icon-lg icon-home"></i>
                主页
              </a>
            </li>
        
            <li class="waves-block waves-effect">
              <a href="/archives"  >
                <i class="icon icon-lg icon-archives"></i>
                Archives
              </a>
            </li>
        
            <li class="waves-block waves-effect">
              <a href="/tags"  >
                <i class="icon icon-lg icon-tags"></i>
                Tags
              </a>
            </li>
        
            <li class="waves-block waves-effect">
              <a href="/categories"  >
                <i class="icon icon-lg icon-th-list"></i>
                Categories
              </a>
            </li>
        
            <li class="waves-block waves-effect">
              <a href="https://github.com/amoyiki" target="_blank" >
                <i class="icon icon-lg icon-github"></i>
                Github
              </a>
            </li>
        
            <li class="waves-block waves-effect">
              <a href="/about"  >
                <i class="icon icon-lg icon-user-circle"></i>
                About
              </a>
            </li>
        
      </ul>
    </div>
  </div>
</aside>

    <main id="main">
        <header class="top-header" id="header">
    <div class="flex-row">
        <a href="javascript:;" class="header-icon waves-effect waves-circle waves-light on" id="menu-toggle">
          <i class="icon icon-lg icon-navicon"></i>
        </a>
        <div class="flex-col header-title ellipsis">Scrapy学习（二） 入门</div>
        
        <div class="search-wrap" id="search-wrap">
            <a href="javascript:;" class="header-icon waves-effect waves-circle waves-light" id="back">
                <i class="icon icon-lg icon-chevron-left"></i>
            </a>
            <input type="text" id="key" class="search-input" autocomplete="off" placeholder="输入感兴趣的关键字">
            <a href="javascript:;" class="header-icon waves-effect waves-circle waves-light" id="search">
                <i class="icon icon-lg icon-search"></i>
            </a>
        </div>
        
        
        <a href="javascript:;" class="header-icon waves-effect waves-circle waves-light" id="menuShare">
            <i class="icon icon-lg icon-share-alt"></i>
        </a>
        
    </div>
</header>
<header class="content-header post-header">

    <div class="container fade-scale">
        <h1 class="title">Scrapy学习（二） 入门</h1>
        <h5 class="subtitle">
            
                <time datetime="2017-01-30T14:12:13.000Z" itemprop="datePublished" class="page-time">
  2017-01-30
</time>


	<ul class="article-category-list"><li class="article-category-list-item"><a class="article-category-list-link" href="/categories/Python/">Python</a></li></ul>

            
        </h5>
    </div>

    

</header>


<div class="container body-wrap">
    
    <aside class="post-widget">
        <nav class="post-toc-wrap" id="post-toc">
            <h4>TOC</h4>
            <ol class="post-toc"><li class="post-toc-item post-toc-level-2"><a class="post-toc-link" href="#快速入门"><span class="post-toc-number">1.</span> <span class="post-toc-text">快速入门</span></a><ol class="post-toc-child"><li class="post-toc-item post-toc-level-3"><a class="post-toc-link" href="#创建一个Scrapy项目"><span class="post-toc-number">1.1.</span> <span class="post-toc-text">创建一个Scrapy项目</span></a></li><li class="post-toc-item post-toc-level-3"><a class="post-toc-link" href="#编写items"><span class="post-toc-number">1.2.</span> <span class="post-toc-text">编写items</span></a></li><li class="post-toc-item post-toc-level-3"><a class="post-toc-link" href="#编写Spider"><span class="post-toc-number">1.3.</span> <span class="post-toc-text">编写Spider</span></a></li><li class="post-toc-item post-toc-level-3"><a class="post-toc-link" href="#运行并保存数据"><span class="post-toc-number">1.4.</span> <span class="post-toc-text">运行并保存数据</span></a></li><li class="post-toc-item post-toc-level-3"><a class="post-toc-link" href="#其他"><span class="post-toc-number">1.5.</span> <span class="post-toc-text">其他</span></a></li></ol></li></ol>
        </nav>
    </aside>
    
<article id="post-Scrapy学习（二）-入门"
  class="post-article article-type-post fade" itemprop="blogPost">

    <div class="post-card">
        <h1 class="post-card-title">Scrapy学习（二） 入门</h1>
        <div class="post-meta">
            <time class="post-time" title="2017年01月30日 22:12" datetime="2017-01-30T14:12:13.000Z"  itemprop="datePublished">2017-01-30</time>

            
	<ul class="article-category-list"><li class="article-category-list-item"><a class="article-category-list-link" href="/categories/Python/">Python</a></li></ul>



            
<span id="busuanzi_container_page_pv" title="文章总阅读量" style='display:none'>
    <i class="icon icon-eye icon-pr"></i><span id="busuanzi_value_page_pv"></span>
</span>


            

        </div>
        <div class="post-content" id="post-content" itemprop="postContent">
            <h2 id="快速入门"><a href="#快速入门" class="headerlink" title="快速入门"></a>快速入门</h2><p>接上篇<a href="http://www.amoyiki.com/2017/01/29/Scrapy%E5%AD%A6%E4%B9%A0%EF%BC%88%E4%B8%80%EF%BC%89-%E5%AE%89%E8%A3%85/" target="_blank" rel="external">Scrapy学习（一） 安装</a>，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。<br><a id="more"></a></p>
<h3 id="创建一个Scrapy项目"><a href="#创建一个Scrapy项目" class="headerlink" title="创建一个Scrapy项目"></a>创建一个Scrapy项目</h3><p>在已配置好的环境下输入</p>
<blockquote>
<p>scrapy startproject dmoz</p>
</blockquote>
<p>系统将在当前目录生成一个myproject的项目文件。该文件的目录结构如下<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">dmoz/    # 项目根目录</span><br><span class="line">   scrapy.cfg    # 项目配置文件</span><br><span class="line">   dmoz/    # 项目模块</span><br><span class="line">       __init__.py</span><br><span class="line">        items.py    # 项目item文件，有点类似Django中的模型</span><br><span class="line">        pipelines.py    # 项目pipelines文件，负责数据的操作和存储</span><br><span class="line">        settings.py    # 项目的设置文件.</span><br><span class="line">        spiders/    # 项目spider目录，编写的爬虫脚步都放此目录下</span><br><span class="line">            __init__.py</span><br></pre></td></tr></table></figure></p>
<p>接下来我们以<code>dmoz.org</code>为爬取目标。开始变现简单的爬虫项目。</p>
<h3 id="编写items"><a href="#编写items" class="headerlink" title="编写items"></a>编写items</h3><p>在items.py中编写我们所需的数据的模型<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> scrapy.item <span class="keyword">import</span> Item, Field</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Website</span><span class="params">(Item)</span>:</span></span><br><span class="line">    name = Field()</span><br><span class="line">    description = Field()</span><br><span class="line">    url = Field()</span><br></pre></td></tr></table></figure></p>
<p>这个模型用来填充我们爬取的数据</p>
<h3 id="编写Spider"><a href="#编写Spider" class="headerlink" title="编写Spider"></a>编写Spider</h3><p>在spiders文件下新建爬虫文件。这部分才是业务的核心部分。<br>首先创建一个继承<code>scrapy.spiders.Spider</code>的类<br>并且定义如下三个属性</p>
<ul>
<li>name 标识spider</li>
<li>start_urls 启动爬虫时进行爬取的url列表，默认为空</li>
<li>parse() 每个初始的url下载后的response都会传到该方法内，在这个方法里可以对数据进行处理。 </li>
</ul>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> scrapy.spiders <span class="keyword">import</span> Spider</span><br><span class="line"><span class="keyword">from</span> scrapy.selector <span class="keyword">import</span> Selector</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> dirbot.items <span class="keyword">import</span> Website</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DmozSpider</span><span class="params">(Spider)</span>:</span></span><br><span class="line">    name = <span class="string">"dmoz"</span></span><br><span class="line">    allowed_domains = [<span class="string">"dmoz.org"</span>]</span><br><span class="line">    start_urls = [</span><br><span class="line">        <span class="string">"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"</span>,</span><br><span class="line">        <span class="string">"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"</span>,</span><br><span class="line">    ]</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">parse</span><span class="params">(self, response)</span>:</span></span><br><span class="line">        sites = response.css(<span class="string">'#site-list-content &gt; div.site-item &gt; div.title-and-desc'</span>)</span><br><span class="line">        items = []</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> site <span class="keyword">in</span> sites:</span><br><span class="line">            item = Website()</span><br><span class="line">            item[<span class="string">'name'</span>] = site.css(</span><br><span class="line">                <span class="string">'a &gt; div.site-title::text'</span>).extract_first().strip()</span><br><span class="line">            item[<span class="string">'url'</span>] = site.xpath(</span><br><span class="line">                <span class="string">'a/@href'</span>).extract_first().strip()</span><br><span class="line">            item[<span class="string">'description'</span>] = site.css(</span><br><span class="line">                <span class="string">'div.site-descr::text'</span>).extract_first().strip()</span><br><span class="line">            items.append(item)</span><br><span class="line">        <span class="keyword">return</span> items</span><br></pre></td></tr></table></figure>
<p>其中值得注意的是，在<code>parse</code>方法内，我们可以用Selector选择器来提取网站中我们所需的数据。提取的方式有几种。</p>
<ul>
<li>xpath() 传入xpath表达式获取节点值</li>
<li>css() 传入css表达式获取节点值</li>
<li>re() 传入正则表达式获取节点值 # 此方法本人未测试</li>
</ul>
<h3 id="运行并保存数据"><a href="#运行并保存数据" class="headerlink" title="运行并保存数据"></a>运行并保存数据</h3><p>接下来我们运行爬虫，并将爬取的数据存储到json中</p>
<blockquote>
<p>scrapy crawl dmoz -o items.json</p>
</blockquote>
<h3 id="其他"><a href="#其他" class="headerlink" title="其他"></a>其他</h3><p>在运行爬虫的过程中，我遇到了如下报错：</p>
<blockquote>
<p>KeyError: ‘Spider not found: dmoz</p>
</blockquote>
<p>这个是因为我的spider类中设置的name的值和我<code>scrapy crawl</code>运行的spider不一致导致的。</p>
<p>具体代码详见：<br><a href="https://github.com/amoyiki/LearnedAndProTest/tree/master/dirbot" target="_blank" rel="external">scrapy入门项目</a></p>

        </div>

        <blockquote class="post-copyright">
    <div class="content">
        
<span class="post-time">
    最后更新时间：<time datetime="2017-02-04T07:08:23.258Z" itemprop="dateUpdated">2017年2月4日 15:08</time>
</span><br>


        这里写留言或版权声明：<a href="/2017/01/30/Scrapy学习（二）-入门/" target="_blank" rel="external">http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/</a>
    </div>
    <footer>
        <a href="http://amoyiki.com">
            <img src="/images/avatar.jpg" alt="amoyiki">
            amoyiki
        </a>
    </footer>
</blockquote>

        
<div class="page-reward">
    <a id="rewardBtn" href="javascript:;" class="page-reward-btn waves-effect waves-circle waves-light">赏</a>
</div>



        <div class="post-footer">
            
	<ul class="article-tag-list"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/Scrapy/">Scrapy</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/学习/">学习</a></li></ul>


            
<div class="page-share-wrap">
    

<div class="page-share" id="pageShare">
    <ul class="reset share-icons">
      <li>
        <a class="weibo share-sns" target="_blank" href="http://service.weibo.com/share/share.php?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&title=《Scrapy学习（二） 入门》 — 四畳半神话大系&pic=http://amoyiki.com/images/avatar.jpg" data-title="微博">
          <i class="icon icon-weibo"></i>
        </a>
      </li>
      <li>
        <a class="weixin share-sns wxFab" href="javascript:;" data-title="微信">
          <i class="icon icon-weixin"></i>
        </a>
      </li>
      <li>
        <a class="qq share-sns" target="_blank" href="http://connect.qq.com/widget/shareqq/index.html?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&title=《Scrapy学习（二） 入门》 — 四畳半神话大系&source=快速入门接上篇Scrapy学习（一） 安装，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。" data-title=" QQ">
          <i class="icon icon-qq"></i>
        </a>
      </li>
      <li>
        <a class="facebook share-sns" target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/" data-title=" Facebook">
          <i class="icon icon-facebook"></i>
        </a>
      </li>
      <li>
        <a class="twitter share-sns" target="_blank" href="https://twitter.com/intent/tweet?text=《Scrapy学习（二） 入门》 — 四畳半神话大系&url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&via=http://amoyiki.com" data-title=" Twitter">
          <i class="icon icon-twitter"></i>
        </a>
      </li>
      <li>
        <a class="google share-sns" target="_blank" href="https://plus.google.com/share?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/" data-title=" Google+">
          <i class="icon icon-google-plus"></i>
        </a>
      </li>
    </ul>
 </div>



    <a href="javascript:;" id="shareFab" class="page-share-fab waves-effect waves-circle">
        <i class="icon icon-share-alt icon-lg"></i>
    </a>
</div>



        </div>
    </div>

    
<nav class="post-nav flex-row flex-justify-between">
  
    <div class="waves-block waves-effect prev">
      <a href="/2017/02/04/Scrapy学习（三）-爬取豆瓣图书信息/" id="post-prev" class="post-nav-link">
        <div class="tips"><i class="icon icon-angle-left icon-lg icon-pr"></i> Prev</div>
        <h4 class="title">Scrapy学习（三） 爬取豆瓣图书信息</h4>
      </a>
    </div>
  

  
    <div class="waves-block waves-effect next">
      <a href="/2017/01/29/Scrapy学习（一）-安装/" id="post-next" class="post-nav-link">
        <div class="tips">Next <i class="icon icon-angle-right icon-lg icon-pl"></i></div>
        <h4 class="title">Scrapy学习（一） 安装</h4>
      </a>
    </div>
  
</nav>



    




<section class="comments" id="comments">
    <div id="disqus_thread"></div>
    <script>
    var disqus_shortname = 'amoyiki';
    lazyScripts.push('//' + disqus_shortname + '.disqus.com/embed.js')
    </script>
    <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
</section>




</article>

<div id="reward" class="page-modal reward-lay">
    <a class="close" href="javascript:;"><i class="icon icon-close"></i></a>
    <h3 class="reward-title">
        <i class="icon icon-quote-left"></i>
        谢谢大爷~
        <i class="icon icon-quote-right"></i>
    </h3>
    <ul class="reward-items">
        

        
    </ul>
</div>



</div>

        <footer class="footer">
    <div class="top">
        
<p>
    <span id="busuanzi_container_site_uv" style='display:none'>
        站点总访客数：<span id="busuanzi_value_site_uv"></span>
    </span>
    <span id="busuanzi_container_site_pv" style='display:none'>
        站点总访问量：<span id="busuanzi_value_site_pv"></span>
    </span>
</p>


        <p>
            <span><a href="/atom.xml" target="_blank" class="rss" title="rss"><i class="icon icon-lg icon-rss"></i></a></span>
            <span>博客内容遵循 <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank">知识共享 署名 - 非商业性 - 相同方式共享 4.0协议</a></span>
        </p>
    </div>
    <div class="bottom">
        <p>
            <span>Power by <a href="http://hexo.io/" target="_blank">Hexo</a> Theme <a href="https://github.com/yscoder/hexo-theme-indigo" target="_blank">indigo</a></span>
            <span>四畳半神话大系 &copy; 2015 - 2017</span>
        </p>
    </div>
</footer>

    </main>
    <div class="mask" id="mask"></div>
<a href="javascript:;" id="gotop" class="waves-effect waves-circle waves-light"><span class="icon icon-lg icon-chevron-up"></span></a>



<div class="global-share" id="globalShare">
    <ul class="reset share-icons">
      <li>
        <a class="weibo share-sns" target="_blank" href="http://service.weibo.com/share/share.php?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&title=《Scrapy学习（二） 入门》 — 四畳半神话大系&pic=http://amoyiki.com/images/avatar.jpg" data-title="微博">
          <i class="icon icon-weibo"></i>
        </a>
      </li>
      <li>
        <a class="weixin share-sns wxFab" href="javascript:;" data-title="微信">
          <i class="icon icon-weixin"></i>
        </a>
      </li>
      <li>
        <a class="qq share-sns" target="_blank" href="http://connect.qq.com/widget/shareqq/index.html?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&title=《Scrapy学习（二） 入门》 — 四畳半神话大系&source=快速入门接上篇Scrapy学习（一） 安装，安装后，我们利用一个简单的例子来熟悉如何使用Scrapy创建一个爬虫项目。" data-title=" QQ">
          <i class="icon icon-qq"></i>
        </a>
      </li>
      <li>
        <a class="facebook share-sns" target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/" data-title=" Facebook">
          <i class="icon icon-facebook"></i>
        </a>
      </li>
      <li>
        <a class="twitter share-sns" target="_blank" href="https://twitter.com/intent/tweet?text=《Scrapy学习（二） 入门》 — 四畳半神话大系&url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/&via=http://amoyiki.com" data-title=" Twitter">
          <i class="icon icon-twitter"></i>
        </a>
      </li>
      <li>
        <a class="google share-sns" target="_blank" href="https://plus.google.com/share?url=http://amoyiki.com/2017/01/30/Scrapy学习（二）-入门/" data-title=" Google+">
          <i class="icon icon-google-plus"></i>
        </a>
      </li>
    </ul>
 </div>


<div class="page-modal wx-share" id="wxShare">
    <a class="close" href="javascript:;"><i class="icon icon-close"></i></a>
    <p>扫一扫，分享到微信</p>
    <img src="" alt="微信分享二维码">
</div>




    <script src="//cdn.bootcss.com/node-waves/0.7.4/waves.min.js"></script>
<script>
var BLOG = { ROOT: '/', SHARE: true, REWARD: true };



</script>

<script src="/js/main.min.js?v=1.4.14"></script>


<div class="search-panel" id="search-panel">
    <ul class="search-result" id="search-result"></ul>
</div>
<template id="search-tpl">
<li class="item">
    <a href="{path}" class="waves-block waves-effect">
        <div class="title ellipsis" title="{title}">{title}</div>
        <div class="flex-row flex-middle">
            <div class="tags ellipsis">
                {tags}
            </div>
            <time class="flex-col time">{date}</time>
        </div>
    </a>
</li>
</template>

<script src="/js/search.min.js?v=1.4.14" async></script>






<script async src="//dn-lbstatics.qbox.me/busuanzi/2.3/busuanzi.pure.mini.js"></script>


</body>
</html>
